1. Introduction

Airbnb is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. It was born in 2007 and, since then, it has grown to 4 million Hosts who have welcomed more than 900 million guest arrivals in almost every country across the globe. The app works as most of the other bookings apps: a client can just select a city he/she want to visit, the dates of the trip and he/she will get a list of possible locations.

The main feature that makes Airbnb unique in its kind is that in the website you can find not only hotels and apartments, but also single rooms that are rented out for a few days from private Hosts.

In this kind of environment Hosts are pushed to compete in an almost-free market, where they have to set the right price for their properties in order to stay competitive and win guests over. We can assume that a good percentage of private Hosts has no experience in both Marketing and Real-Estate, hence the definition of the right price could become a entry barrier that stops most of them from ever trying to compete.

Our goal in this paper will be to understand which are the components that drive prices and to construct a model that can help both Hosts to define the right price for their properties and Guests to check weather a location is truthfully priced.

2. Data

We start our analysis by presenting our dataset. Before diving into the actual variables we present the six different cities and the related number of records:

For a better understanding we splitted the variables into five different groups that will be presented below.

3. Definition of the dependent variable

The main goal of this analysis is to find what drive prices, the main issue we need to address before diving in is: How do we define Price?.

We observe that there are different variables related to prices and fees, that are:

For the scope of this paper we decide that we will define price as the per-person price of a seven-night stay at the property, computed as:

# Compute new price and remove useless columns
df = df %>% mutate(Price = (7*Price + Cleaning.Fee)/Accommodates) %>% select(-Cleaning.Fee)

4. Exploratory analysis

We now present our dataset in a more sofisticated way, focusing on the relationships between the different variables that we collected and the prices.

We start by showing how prices differs by city:

We can see from the two graphs above that 4 out of the 6 cities are quite similar (Barcelona, Rome, Wien and Berlin), with an average 7-nights per-person price of about 160€. The two “outliers” are London, with a value of 220€ and Amsterdam, which unexpectedly shows a price of almost 350€.

Because we collected a sample dataset of less than 10k observations, we know that our analysis cannot assess that Amsterdam is overrall the most expensive city; that is because in the investigation above we do not control for other factors, such as accomodation and service features.

Proceeding with the order we used in the Data section, we will now try to assess if the different groups of variables could be considered a driver for prices.

5. Correlation between variables

We will now show the qualitative and quantitative correlations between the variables of our dataset:

From the graph above we can see that there is an high correlation between the variables which define the actual property, this is basically due to the fact that a high number of beds leads to a lot of bathrooms for the guests which leads to high square footage etc. This first interpretation is pretty basic and we will not dive deeper into that.

As spotted before we can also see the small correlation between Host.Since and Number of reviews, this could be explained by the fact that a property listed for a long time gets more guests and consequently more reviews.

From the graph above we can start drawing some key findings:

6. Supervised learning

After this exploratory analysis, which helped us to better visualize and refine our dataset, we start with a supervised learning approach. The goal here would be to build a model that can accurately predict the price of a property given its features; this kind of model could be implemented in different use cases such as:

6.1 Basic models

Given the composition of our dataset, that present both quantitative and qualitative variables, we decide to implement a Tree-based algorithm. This kind of algorithm is definetly efficent in hybrid datasets and will provide us with great interpretability; in this way we can both get an efficent model and understand which are the main variables that drive Airbnb prices.

We start by estimating a tree over the whole dataset (80-20 train-test split).

## 
## Regression tree:
## rpart(formula = Price ~ ., data = df_model, subset = train.full)
## 
## Variables actually used in tree construction:
## [1] Accommodates City         Room.Type   
## 
## Root node error: 113060445/7780 = 14532
## 
## n= 7780 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.296703      0   1.00000 1.00030 0.037903
## 2 0.025776      1   0.70330 0.70400 0.032991
## 3 0.021319      4   0.62597 0.62826 0.031799
## 4 0.010646      6   0.58333 0.58750 0.031629
## 5 0.010373      7   0.57269 0.57253 0.031530
## 6 0.010000      9   0.55194 0.57041 0.031993

From the tree structure we can easily identify the main drivers for our price variables:

  • City: we start by splitting Amsterdam from the other cities, this is due to the fact that, as shown above, this city present relatively higher price than the others. At the second step we also construct a different subtree for London which, as shown above, present prices that are lower than the ones in Amsterdam but still definetly higher than the other cities.
  • Room Type: all the subtrees proceed by separating Private/Private&Shared rooms from entire apartments.
  • Accomodates: at the end all trees proceed by splitting by number of guests that the property can hold, where, as discussed above, more people in the same property lead to lower prices.

We now proceed to predict the values for our test set and evaluate them.

## [1] "Root Mean Squared Error:  80.8827730952067"

We now proceed by trying to estimate different trees for the different cities to see if there is any difference in the nodes:

From the trees above we can see that the six different datasets construct trees that differs. The nodes that are always present in all six trees are obviously room type and accomodates, which, as seen also above, seems to be the main drivers for the price. The other variables that appears as expected are mainly:

  • Host since: a more experienced host is always correlated with an higher price (as discussed above).
  • Air conditioning: the presence of air conditioning increment the price (almost 50€!).
  • Security deposit: a high security deposit is correlated to an high price, this may be due to the fact that properties that requires a security deposit are more valuable.
  • Rating: as expected an higher rating is correlated to an higher price.
  • # Bathrooms: having more than 1.5 bathrooms (at least 2) is correlated to an higher price.

6.2 Advanced models

At last we try to construct a more robust model in which we try to achieve more accurate results at the expense of some interpretability.

6.2.1 Random Forest

We start by evaluating a Random Forest over our full dataset.

## 
## Call:
##  randomForest(formula = Price ~ ., data = df_model, subset = train.full) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##           Mean of squared residuals: 6300.264
##                     % Var explained: 56.65

## [1] "Root Mean Squared Error:  73.1557067944934"

We can see that this kind of model explain just 56% of the variability of our dataset, and it returns a Root Mean Squared Error of 73.15 (a major improvement from the 80 of the single tree model).

6.2.1 Boosting

We proceed by applying boosting.

##                                       var      rel.inf
## City                                 City 44.244017976
## Accommodates                 Accommodates 13.864894540
## Room.Type                       Room.Type 10.122921249
## Review.Scores.Rating Review.Scores.Rating  5.231249883
## Security.Deposit         Security.Deposit  4.487169469
## Host.Since                     Host.Since  4.088197021
## Host.Response.Rate     Host.Response.Rate  2.995677382
## Bedrooms                         Bedrooms  2.704422681
## Property.Type               Property.Type  2.619921511
## Bathrooms                       Bathrooms  2.509292531
## Number.of.Reviews       Number.of.Reviews  1.928825245
## Air_conditioning         Air_conditioning  1.739716127
## Cancellation.Policy   Cancellation.Policy  0.755789815
## Maximum.Nights             Maximum.Nights  0.710782791
## Experiences.Offered   Experiences.Offered  0.576914834
## Host.Response.Time     Host.Response.Time  0.399784217
## Beds                                 Beds  0.366973051
## Instant_Bookable         Instant_Bookable  0.318356425
## Bed.Type                         Bed.Type  0.192047755
## Host.SuperHost             Host.SuperHost  0.043803663
## Washer                             Washer  0.034018054
## Host.verified               Host.verified  0.031642412
## Kitchen                           Kitchen  0.029493105
## Breakfast                       Breakfast  0.004088264
## Host.ProfilePic           Host.ProfilePic  0.000000000

## [1] "Root Mean Squared Error:  73.8598968588597"

We can see from the Relative influence plot shown above that the most relevant drivers for prices are the same we found in the single tree model. In this case, as in the RF model, the Root Mean Squared Error lowers, at 73.85.

6.2 Conclusions

We can conclude that, while this kind of models gave us a huge help in interpret our dataset, they still seem to slightly underperform. The reason for this could be the fact that our dataset is really heterogeneous, beign composed by cities that differs so much even in the mean price level. Another reason could be the fact that we still do not have all the variables we need to correctly construct a prediction model.

Some Next steps in order to improve the performance of our model and perform a more complete analysis could be:

  • Add more relevant variables to the dataset, such as Distance from city center.
  • Implement a Convolutional Neural Network that can evaluate the photos of the property and give a rate.
  • Perform a more accurate analysis that could implement also a time span, as it is possible that summer/winter season directly affect the prices shown in the platform (sadly I did not find open data regarding this).

7. Unsupervised Learning

Now we will try and approach the problem with unsupervised learning models. This new approach could help us better understand our findings of the analysis above but also show us a different point of view when looking at the dataset.

7.1 Principal Component Analysis

We will start from the most famous unsupervised learning model, that can help us identify the principal components in which our data are spread out.

Before computing the model we recall that PCA can only be applied to quantitative variables; this means that even if we expect the results of this model be aligned with our previous findings, we cannot fully rely on this model in order to completely analyse our dataset.

From the plots above we can see that there is clearly one dimension in which the data are spread out the most, and it is the one related to the actual property (accomodates, beds, bathrooms and bedrooms).

We then have a second and third dimension, both capturing less than a half of the variability of the first dimension and they are respectively a value-of-property dimension, given the fact that the main components are Price, Rating and Host.Since (recall that we interpred experienced host as an added value) and what it seems to be a popularity dimension, given the fact that its main component is given by the number of reviews.

This model is aligned with our first analysis, and it assess that the variability of the dataset can be reconducted by group of variables that we already presented as correlated.

Unfortunately we cannot see any clear cluster in the graph reporting the projected dimensions, this could be easily caused by the fact that all qualitative variables were left out from this model.

We shown below that the kmeans algorithm confirm our expectations: we are not able to identify clear clusters just by looking at qunatitative variables.

As expected the two clusters we get (that seems to try and separate small properties, with low number of beds and accomodetes from large ones) are completely overlapped.

7.2 Hierarchical Clustering

Given what we said aboce, we will now try to compute different cluster using hierarchical clustering; in this way we would be able to add also qualitative variables to our analysis, hopefully improving the model.

In order to evaluate the model with both quantitative and qualitative variables we use the Gower distance.

We start by trying to compute 6 clusters (that we hope to reconduct to our 6 cities). We will remove in this analysis the variable City in order to actually understand if there our cities actually differs.

From the graphs above we can see that there is not actual difference and the 6 clusters seems to overlap in each of the cities.

We will now try the same method but with a small number of cluster, in order to understand if there are some differences between properties that are not related to the city but maybe could be related by different locations in the cities

As can be see from the graph above we did not get the results that we expected, the clusters seems to overlap in all the cities without any geographical meaning.

We will now try and define different clustering models, one for each city, in order to understand if our data can be grouped in a way that has some geographical interpretation.

We can notice that in some cities (like London and Berlin) there seems to be a slightly geographical interpretation of the clusters, but they still seem to overlap in most part of the city.

7.3 Conclusions

This unsupervised approach definetly helped us for the first part. in identifying the principal components in which our data are spread out.

Unfortunately we did not obtain the result that we expected in the clustering analysis; this could be due to the fact that we have a lot of variables (like host-related) that are not correlated to the construction of geographical clusters.

Some Next steps for this kind of analysis would be:

  • Same as before, adding some variables that could help us better cluster our data.
  • Try and work with non-geographical clusters in order to assess if there is any other way in which uor data can be grouped.
  • Try and work with lesser variables in order to better interpret and identify geographical clusters